Visualization & Style Transfer

Shan-Hung Wu & DataLab
Fall 2021

This tutorial is going to show how to load and use a pretrained model from tensorflow library and discusses some techniques to visualize what the networks represent in the selected layers. In addition, we will introduce an interesting work called neural style transfer, using deep learning to compose one image in the style of another image.

Import and configure modules

Visualize Convolutional Neural Networks

Visualize the input

Define a function to load an image and limit its maximum dimension to 512 pixels.

Create a simple function to display an image:

Load a pretrained network (VGG19)

We are going to visualize one of most remarkable neural networks, VGG19, which is introduced from this paper, pretrained on ImageNet. VGG19 is known for its simplicity, using only 3×3 convolutional layers stacked on top of each other in increasing depth. The "19" in its name stands for the number of layers in the network. ImageNet is a large dataset used in ImageNet Large Scale Visual Recognition Challenge(ILSVRC). The training dataset contains around 1.2 million images composed of 1000 different types of objects. The pretrained network learned how to create useful representations of the data to differentiate between different classes.

Pretrained network is useful and conveninent for several further usage, such as style transfer, transfer learning, fine-tuning, and so on. Generally, using pretrained network can save a lot of time and also easier to train a model on more complex dataset or small dataset.

Load a VGG19 and test run it on our image to ensure it's used correctly. The output of VGG19 is the probabilities corresponding to 1000 categories.

Obtain the top 5 predicted categories of the input image.

Let's try to first print out the detailed structure of VGG19. vgg.summary() shows the name, output shape and the number of parameters of each layer.

Visualize filters

Now we can visualize the weights of the convolution filters to help us understand what neural network have learned. In neural network terminology, the learned filters are simply weights, yet because of the specialized two-dimensional structure of the filters, the weight values have a spatial relationship to each other and plotting each filter as a two-dimensional image is meaningful (or could be).

We can access the block of filters and the block of bias values through layer.get_weight(). In VGG19, all convolitional layers use 3x3 filters.

Let's look at every single individual filter in the first convolutional layer. We will see all 64 in the block and plot each of the three channels. It is worth to meantion that in the first convolutional layer, it has total 192 feature maps(64 filters * 3 channels). We can normalize their values to the range 0-1 to make them easy to visualize.

The dark squares indicate small or inhibitory weights and the light squares represent large or excitatory weights. Using this intuition, we can see that the filters on the first row detect a gradient from light in the top left to dark in the bottom right.

Visualize feature maps

The activation maps, called feature maps, capture the result of applying the filters to input, such as the input image or another feature map. The idea of visualizing a feature map for a specific input image would be to understand what features of the input are detected or preserved in the feature maps. The expectation would be that the feature maps close to the input detect small or fine-grained detail, whereas feature maps close to the output of the model capture more general features.

We can see that the result of applying the filters in the first convolutional layer is a lot of versions of the input image with different features highlighted. For example, some highlight lines, other focus on the background or the foreground.

Let's visualize the feature maps output from each block of the model. You might notice that the number of feature maps (e.g. depth or number of channels) in deeper layers is much more than 64, such as 256 or 512. Nevertheless, we can cap the number of feature maps visualized at 64 for consistency.

We can see that the feature maps closer to the input of the model capture a lot of fine detail in the image and that as we progress deeper into the model, the feature maps show less and less detail.

This pattern was to be expected, as the model abstracts the features from the image into more general concepts that can be used to make a classification. Although it is not clear from the final image that the model saw NTHU campus, we generally lose the ability to interpret these deeper feature maps.

Visualize gradients

Visualizing convolutional output is a pretty useful technique for visualizing shallow convolution layers, but when we get into the deeper layers, it's hard to understand them just by just looking at the convolution output.

If we want to understand what the deeper layers are really doing, we can try to use backpropagation to show us the gradients of a particular neuron with respect to our input image. We will make a forward pass up to the layer that we are interested in, and then backpropagate to help us understand which pixels contributed the most to the activation of that layer.

We first create an operation which will find the maximum neurons among all activations in the required layer, and then calculate the gradient of that objective with respect to the input image.

Compute the gradient of maximum neurons among all activations in the required layer with respect to the input image.

It is hard to understand the gradient in that range of values. We can normalize the gradient in a way that lets us see it more in terms of the normal range of color values. After normalizing the gradient values, let's visualize the original image and the output of the backpropagated gradient.

We can also visualize the gradient of any single feature map.

Guided-Backpropagation

As we can see above, the results are still hard to explain and not very satisfying. Every pixel influences the neuron via multiple hidden neurons.

Ideally, neurons act like detectors of particular image features. We are only interested in what image features the neuron detects, not in what kind of stuff it doesn’t detect. Therefore, when propagating the gradient, we set all the negative gradients to 0.

We call this method guided backpropagation, because it adds an additional guidance signal from the higher layers to usual backpropagation. This prevents backward flow of negative gradients, corresponding to the neurons which decrease the activation of the higher layer unit we aim to visualize. For more details, please refer to Striving for Simplicity: The All Convolutional Net, a nice work from J. T. Springenberg and A. Dosovitskiy et al.

A Neural Algorithm of Artistic Style

Visualizing neural network gives us a better understanding of what's going in the mysterious huge network. Besides from this application, Leon Gatys and his co-authors has a very interesting work called "A Neural Algorithm of Artistic Style" that uses neural representations to separate and recombine content and style of arbitrary images, providing a neural algorithm for the creation of artistic images.

Define content and style representations

Use the intermediate layers of the model to get the content and style representations of the image. Starting from the network's input layer, the first few layers' activations represent low-level features like edges and textures. As you step through the network, the final few layers represent higher-level features—object parts like wheels or eyes. In this case, you are using the VGG19 network architecture, a pretrained image classification network. These intermediate layers are necessary to define the representation of content and style from the images. For an input image, try to match the corresponding style and content target representations at these intermediate layers.

Now load a VGG19 without the classification head, and list the layer names.

At a high level, in order for a network to perform image classification (which this network has been trained to do), it must understand the image. This requires taking the raw image as input pixels and building an internal representation that converts the raw image pixels into a complex understanding of the features present within the image.

This is also a reason why convolutional neural networks are able to generalize well: they’re able to capture the invariances and defining features within classes (e.g. cats vs. dogs) that are agnostic to background noise and other nuisances. Thus, between the raw image fed into the model and the output, in this case, the predicted label, the model serves as a complex feature extractor. By accessing intermediate layers of the model, you're able to describe the content and style of input images.

Build the model

The networks in tf.keras.applications are designed so you can easily extract the intermediate layer values using the Keras functional API.

To define a model using the functional API, specify the inputs and outputs:

model = Model(inputs, outputs)

Calculate style

The style of an image can be described by the means and correlations across the different feature maps. Calculate a Gram matrix that includes this information by taking the outer product of the feature vector with itself at each location, and averaging that outer product over all locations. This Gram matrix can be calcualted for a particular layer as:

$\cfrac{G^{l}_{cd} = \sum_{ij}F^{l}_{ijc}(x)F^{l}_{ijd}(x)}{IJ}$

Extract style and content

Define loss

Our goal is to create an output image which is synthesized by finding an image that simultaneously matches the content features of the photograph and the style features of the respective piece of art. How can we do that? We can define the loss function as the composition of:

  1. The dissimilarity of the content features between the output image and the content image
  2. The dissimilarity of the style features between the output image and the style image to the loss function

The following figure gives a very good visualization of the process:

Run gradient descent

Let's train with more iteration to see the results!

Total variation loss

One downside to this basic implementation is that it produces a lot of high frequency artifacts. Decrease these using an explicit regularization term on the high frequency components of the image. In style transfer, this is often called the total variation loss:

$V(y)=\sum_i \sum_j\sqrt{(y_{i+1,j}-y_{i,j})^2 + (y_{i,j+1}-y_{i,j})^2}$

In practice, to speed up the computation, we implement the following version instead:

$V(y)=\sum_i \sum_j|y_{i+1,j}-y_{i,j}| + |y_{i,j+1}-y_{i,j}|$

This shows how the high frequency components have increased. Also, this high frequency component is basically an edge-detector. You can get similar output from the Sobel edge detector, for example:

Re-run the optimization

With total variational loos, the image has better quality and looks really like a masterpiece of Vincent van Gogh, right?

AdaIN

The method we mentioned above requires a slow iterative optimization process, which limits its pratical application. Xun Huang and Serge Belongie from Cornell University propose another framework, which enables arbitrary style tranfer in real-time, known as "Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization".

AdaIN can transfer arbitrary new styles in real-time, combining the flexibility of the optimization-based framework and the speed similar to the fastest feed-forward approaches. At the heart of this method is a novel Adaptive Instance Normalization (AdaIN) layer aligning the mean and variance of the content features with those of the style features. Instance normalization performs style normalization by normalizing feature statistics, which have been found to carry the style information of an image in the earlier works. A decoder network is then learned to generate the final stylized image by inverting the AdaIN output back to the image space.

Here we use MSCOCO 2014 testing dataset as our content dataset, while using WikiArt testing dataset as style dataset, containing 40,736 and 23,585 images respectively. To prevent from misunderstanding, we have to clarify the reason to use testing dataset instead of training dataset. The size of whole MSCOCO 2014 and WikiArt is more than 45G, which might be too heavy for this tutorial. In addition, our purpose is to train an style transfer model rather than image classification or object detection, thus using testing dataset is nothing to worry about.

Dataset API

Before creating dataset API, first we have to remove some unwanted data, such as small images or grayscale images.

VGG19 was trained by Caffe which converted images from RGB to BGR, then zero-centered each color channel with respect to the ImageNet dataset, without scaling. Therefore, we have to do the same thing during data preprocessing.

Adaptive Instance Normalization

AdaIN recieves a content input $x$ and style input $y$, and simply aligns the channelwise mean and variance of $x$ to match those of $y$. It is worth knowning that unlike BN(Batch Normalization), IN(Instance Normalization) or CIN(Conditional Instance Normalization), AdaIN has no learnable affine parameters. Instead, it adaptively computes the affine parameters from the style input:

$\text{AdaIN}(x,\,y) = \sigma(y)\,(\cfrac{x - \mu(x)}{\sigma(x)}) + \mu(y)$

Model

We use the first few layers of a fixed VGG-19 network to encode the content and style images. An AdaIN layer is used to perform style transfer in the feature space. A decoder is learned to invert the AdaIN output to the image spaces. Moreover, we use the same VGG encoder to compute a content loss $L_c$ and style loss $L_s$. Here we define $t$ as the output of AdaIN layer.

$t = \text{AdaIN}(f(c),\,f(s))$.

Next, we can define the loss function which is composed of content loss and style loss, where $\lambda$ is weighting factor.
$L = L_c + \lambda L_s$.

The content loss is the Euclidean distance between the target features and the features of the output image. We use the AdaIN output $t$ as the content target, instead of the commonly used feature responses of the content image. The author found this leads to slightly faster convergence and also aligns with our goal of inverting the AdaIN output $t$.
$L_c = \|\,f(g(t)) - t\,\|_2$

Since AdaIN layer only transfers the mean and standard deviation of the style features, our style loss only matches these statistics. Although we find the commonly used Gram matrix loss can produce similar results, we match the IN statistics because it is conceptually cleaner. This style loss has also been explored by [Li et al](https://arxiv.org/pdf/1701.01036.pdf).
$L_s = \sum\limits_{i=1}^{L}\|\,\mu(\phi_i(g(t)))-\mu(\phi_i(s))\,\|_2 + \sum\limits_{i=1}^{L}\|\,\sigma(\phi_i(g(t)))-\sigma(\phi_i(s))\,\|_2$

Training

Testing

NTHU Example

One of the most important advantages of AdaIN is speed. Earlier we have implemented iterative style transfer, which takes roughly 30 seconds per image on Nvidia GeForce RTX 2080Ti, meanwhile AdaIN is up to three orders of magnitude faster than the former. Here we demostrate the power of AdaIN with single content and 25 distinct styles.

Save and Load Models

Model progress can be saved during and after training. This means a model can resume where it left off and avoid long training times. Saving also means you can share your model and others can recreate your work. When publishing research models and techniques, most machine learning practitioners share:

Sharing this data helps others understand how the model works and try it themselves with new data.

The phrase "Saving a Tensorflow model" typically means one of two things:

  1. Checkpoints, OR
  2. SavedModel.

Checkpoints capture the exact value of all parameters (tf.Variable objects) used by a model. Checkpoints do not contain any description of the computation defined by the model and thus are typically only useful when source code that will use the saved parameter values is available.

The SavedModel format on the other hand includes a serialized description of the computation defined by the model in addition to the parameter values (checkpoint). Models in this format are independent of the source code that created the model. They are thus suitable for deployment via TensorFlow Serving, TensorFlow Lite, TensorFlow.js, or programs in other programming languages (the C, C++, Java, Go, Rust, C# etc. TensorFlow APIs). Saving a fully-functional model is very useful - you can load them in TensorFlow.js and then train and run them in web browsers, or convert them to run on mobile devices using TensorFlow Lite.

Example of the graph defined by the model, which is visualized by TensorBoard.

Inside checkpoints

Before starting this tutorial, you should know what kinds of information are stored in the checkpoints. There are various parameters used by the model, including hyperparameters, weights and optimizer slot variables. TensorFlow matches variables to checkpointed values by traversing a directed graph with named edges, starting from the object being loaded. Edge names typically come from attribute names in objects.

The dependency graph looks like this:

With the optimizer in red, regular variables in blue, and optimizer slot variables in orange. The other nodes, for example representing the tf.train.Checkpoint, are black.

Slot variables are part of the optimizer's state, but are created for a specific variable. For example the 'm' edges above correspond to momentum, which the Adam optimizer tracks for each variable. Slot variables are only saved in a checkpoint if the variable and the optimizer would both be saved, thus the dashed edges.

This tutorial covers APIs for writing and reading checkpoints. For more information about SavedModel API, see Using the SavedModel format and Save and load models guides.

There are several ways to save TensorFlow models, depending on the API you are using. In this section, we are going to demonstrate

For simplicity, here we use MNIST dataset to demonstrate how to save and load wegihts.

Save checkpoints during training

You can use a trained model without having to retrain it, or pick-up training where you left off - in case the training process was interrupted. The tf.keras.callbacks.ModelCheckpoint callback allows to continually save the model both during and at the end of training, and this method saves all parameters used by a model, including weights and optimizer. The callback provides several options to create unique names for checkpoints and adjust the checkpointing frequency.

Checkpoint callback usage

Now rebuild a fresh, untrained model, and evaluate it on the test set. An untrained model will perform at chance levels (~10% accuracy):

Create a new, untrained model. When restoring a model from weights-only, you must have a model with the same architecture as the original model. Since it's the same model architecture, you can share weights despite that it's a different instance of the model.

After loading the weights from the checkpoint, we can re-evaluate the model. As you can see, the accuracy raises up to 85.5%, which is same as the one we have trained earlier.

What are these files?

The above code stores the weights to a collection of checkpoint-formatted files that contain only the trained weights in a binary format. Checkpoints contain: One or more shards that contain your model's weights. An index file that indicates which weights are stored in a which shard.

Manually save weights

We just demonstrated how to save and load the weights into a model when using Model.fit. Manually saving them is just as simple with the Model.save_weights method, and it is quite useful during custom training. In our Deep Learning course, most of assignments and competitions are required custom training.

Another thing you should notice is the difference between tf.keras.callbacks.ModelCheckpoint and Model.save_weights. The former one saves all parameters used in model, inclusing weights and optimizers, while the latter one only saves weights. No information about optimizer is saved. Therefore, if you restore the checkpoints saved by Model.save_weights method, it is not possible to pick-up training where you left off exactly. Fortunately, in most cases, information relevant to optimizer is not that important comparing to weights. In addition, since Model.save_weights only stores weights, the checkpoint files are lighter than the one created by tf.kears.callbacks.ModelCheckpoint.

Here you might encounter FailedPreconditionError. Please recompile the cell containing def train_step and def test_step, and then run the following cells.

Manually checkpointing

Another way to save checkpoint during custom training is to use tf.train.Checkpoint API, capturing the exact value of all parameters used by model. It is distinct from Model.save_weights, which saves only weights but the optimizer.

To manually make a checkpoint you will need a tf.train.Checkpoint object. Where the objects you want to checkpoint are set as attributes on the object.

A tf.train.CheckpointManager can also be helpful for managing multiple checkpoints.

After the first you can pass a new model and manager, but pickup training exactly where you left off.

For more information about tf.train.Checkpoint and the detailed structure of saved checkpoints, please check Training checkpoints guide.

Reference

Assignment

In this assignment, you need to do following things:

Part I (A Neural Algorithm of Artistic Style)

  1. Implement total variational loss. tf.image.total_variation is not allowed (10%).
  2. Change the weights for the style, content, and total variational loss (10%).
  3. Use other layers in the model (10%).
    • You need to calculate both content loss and style loss from different layers in the model
  4. Write a brief report. Explain how the results are affected when you change the weights, use different layers for calculating loss (10%).
    • Insert markdown cells in the notebook to write the report.

Part II (AdaIN)

  1. Implement AdaIN layer and use single content image to create 25 images with different styles (60%).

You can dowaload WikiArt and MSCOCO 2014 from here.

Requirements: